Skip to content

Conversation

@tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Aug 28, 2025

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

With this fix only the respective member is being considered unhealthy:

[etcd-operator-566ff8dd4-dn5qx] E0828 11:20:04.415537       1 health.go:120] health check for member (tjungblu15-dq6nb-master-0) failed: err(context deadline exceeded)
[etcd-operator-566ff8dd4-dn5qx] W0828 11:20:04.415708       1 etcdcli.go:356] UnhealthyEtcdMember found: [tjungblu15-dq6nb-master-0]
...
[etcd-operator-566ff8dd4-dn5qx] E0828 11:21:01.300834       1 base_controller.go:279] "Unhandled Error" err="DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, tjungblu15-dq6nb-master-0 is unhealthy"


When one member is timing out, the others were also declared unhealthy
due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health
check.

Signed-off-by: Thomas Jungblut <[email protected]>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 28, 2025
@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dusk125 and jubittajohn August 28, 2025 10:52
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2025
memberCtx, cancel := context.WithTimeout(ctx, DefaultClientTimeout)
defer cancel()

memberHealth[i] = checkSingleMemberHealth(memberCtx, cli, member)
Copy link
Contributor

@lance5890 lance5890 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's potential data race issue in the goroutine - it's capturing the loop variable member by reference. This can cause a data race where all goroutines end up checking the last member. we should pass member as an argument: go func(i int, m *etcdserverpb.Member) {...}(i, member)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the quick review @lance5890 - I think this should be fixed in later go versions?
https://go.dev/blog/loopvar-preview

@dusk125
Copy link
Contributor

dusk125 commented Aug 28, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tjungblu
Copy link
Contributor Author

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling
/override ci/prow/e2e-metal-assisted

@tjungblu
Copy link
Contributor Author

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-aws-cpms, ci/prow/e2e-aws-ovn-etcd-scaling, ci/prow/e2e-metal-assisted

In response to this:

/override ci/prow/e2e-aws-cpms
/override ci/prow/e2e-aws-ovn-etcd-scaling
/override ci/prow/e2e-metal-assisted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 28, 2025

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-disruptive cd42ba7 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-disruptive cd42ba7 link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-ovn-two-node-fencing cd42ba7 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-azure-ovn-etcd-scaling cd42ba7 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-etcd-scaling cd42ba7 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling cd42ba7 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive-ovn cd42ba7 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown cd42ba7 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown cd42ba7 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive-ovn cd42ba7 link false /test e2e-aws-disruptive-ovn
ci/prow/e2e-aws-etcd-recovery cd42ba7 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-aws-etcd-certrotation cd42ba7 link false /test e2e-aws-etcd-certrotation

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tjungblu
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tjungblu
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 28, 2025
@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-60941, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from geliu2016 August 28, 2025 15:16
@openshift-merge-bot openshift-merge-bot bot merged commit 9091149 into openshift:main Aug 28, 2025
22 of 34 checks passed
@openshift-ci-robot
Copy link

@tjungblu: Jira Issue OCPBUGS-60941: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-60941 has been moved to the MODIFIED state.

In response to this:

When one member is timing out, the others were also declared unhealthy due to the shared timeout being cancelled.
This adds a new context with an individual timeout to each member health check.

With this fix only the respective member is being considered unhealthy:

[etcd-operator-566ff8dd4-dn5qx] E0828 11:20:04.415537       1 health.go:120] health check for member (tjungblu15-dq6nb-master-0) failed: err(context deadline exceeded)
[etcd-operator-566ff8dd4-dn5qx] W0828 11:20:04.415708       1 etcdcli.go:356] UnhealthyEtcdMember found: [tjungblu15-dq6nb-master-0]
...
[etcd-operator-566ff8dd4-dn5qx] E0828 11:21:01.300834       1 base_controller.go:279] "Unhandled Error" err="DefragController reconciliation failed: cluster is unhealthy: 2 of 3 members are available, tjungblu15-dq6nb-master-0 is unhealthy"


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tjungblu tjungblu deleted the main branch August 28, 2025 15:22
@tjungblu
Copy link
Contributor Author

/cherry-pick release-4.19

@openshift-cherrypick-robot

@tjungblu: new pull request created: #1475

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants